The Automatic Content Extraction (ACE) program

نویسندگان

  • George Doddington
  • Alexis Mitchell
  • Mark Przybocki
  • Lance Ramshaw
  • Stephanie Strassel
چکیده

The objective of the ACE program is to develop technology to automatically infer from human language data the entities being mentioned, the relations among these entities that are directly expressed, and the events in which these entities participate. Data sources include audio and image data in addition to pure text, and Arabic and Chinese in addition to English. The effort involves defining the research tasks in detail, collecting and annotating data needed for training, development, and evaluation, and supporting the research with evaluation tools and research workshops. This program began with a pilot study in 1999. The next evaluation is scheduled for September 2004. Introduction and Background Today’s global web of electronic information, including most notably the www, provides a resource of unbounded information-bearing potential. But to fully exploit this potential requires the ability to extract content from human language automatically. That is the objective of the ACE program – to develop the capability to extract meaning from multimedia sources. These sources include text, audio and image data. The ACE program is a “technocentric” research effort, meaning that the emphasis is on developing core enabling technologies rather than solving the application needs that motivate the research. The program began in 1999 with a study intended to identify those key content extraction tasks to serve as the research targets for the remainder of the program. These tasks were identified in general as the extraction of the entities, relations and events being discussed in the language. In general objective, the ACE program is motivated by and addresses the same issues as the MUC program that preceded it (NIST 1999). The ACE program, however, attempts to take the task “off the page” in the sense that the research objectives are defined in terms of the target objects (i.e., the entities, the relations, and the events) rather than in terms of the words in the text. For example, the so-called “named entity” task, as defined in MUC, is to identify those words (on the page) that are names of entities. In ACE, on the other hand, the corresponding task is to identify the entity so named. This is a different task, one that is more abstract and that involves inference more explicitly in producing an answer. In a real sense, the task is to detect things that “aren’t there”. Reference resolution thus becomes an integral and critical part of solving the problem. During the period 2000-2001, the ACE effort was devoted solely to entity detection and tracking. During the period 2002-2003, relations were explored and added. 1 While the ACE program is directed toward extraction of information from audio and image sources in addition to pure text, the research effort is restricted to information extraction from text. The actual transduction of audio and image data into text is not part of the ACE research effort, although the processing of ASR and OCR output from such transducers is. Now, starting in 2004, events are being explored and added as the third of the three original tasks.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Annotation of Semantic Relations in Patent Documents

This paper presents the theoretical bases and quantitative results of an activity consisting in manually annotating part-whole and motion relations in patent documents. The aim of this activity was creating a gold standard for the evaluation of an automatic relation extraction tool developed by FBK-irst within the PATExpert project. For this purpose, we took the annotation scheme created for th...

متن کامل

Linguistic Resources and Evaluation Techniques for Evaluation of Cross-Document Automatic Content Extraction

The NIST Automatic Content Extraction (ACE) Evaluation expands its focus in 2008 to encompass the challenge of cross-document and cross-language global integration and reconciliation of information. While past ACE evaluations were limited to local (within-document) detection and disambiguation of entities, relations and events, the current evaluation adds global (cross-document and cross-langua...

متن کامل

The Automatic Content Extraction (ACE) Program - Tasks, Data, and Evaluation

The objective of the ACE program is to develop technology to automatically infer from human language data the entities being mentioned, the relations among these entities that are directly expressed, and the events in which these entities participate. Data sources include audio and image data in addition to pure text, and Arabic and Chinese in addition to English. The effort involves defining t...

متن کامل

Combining Lexical, Syntactic, and Semantic Features with Maximum Entropy Models for Information Extraction

Extracting semantic relationships between entities is challenging because of a paucity of annotated data and the errors induced by entity detection modules. We employ Maximum Entropy models to combine diverse lexical, syntactic and semantic features derived from the text. Our system obtained competitive results in the Automatic Content Extraction (ACE) evaluation. Here we present our general ap...

متن کامل

Unsupervised Information Extraction Approach Using Graph Mutual Reinforcement

Information Extraction (IE) is the task of extracting knowledge from unstructured text. We present a novel unsupervised approach for information extraction based on graph mutual reinforcement. The proposed approach does not require any seed patterns or examples. Instead, it depends on redundancy in large data sets and graph based mutual reinforcement to induce generalized “extraction patterns”....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004